X-sepformer: End-to-end Speaker Extraction Network With Explicit Optimization On Speaker Confusion
2023 Β· Kai Liu, Ziqing Du, Xucheng Wan, et al.
Abstract
Target speech extraction (TSE) systems are designed to extract target speech from a multi-talker mixture. The popular training objective for most prior TSE networks is to enhance reconstruction performance of extracted speech waveform. However, it has been reported that a TSE system delivers high reconstruction performance may still suffer low-quality experience problems in practice. One such experience problem is wrong speaker extraction (called speaker confusion, SC), which leads to strong negative experience and hampers effective conversations. To mitigate the imperative SC issue, we reformulate the training objective and propose two novel loss schemes that explore the metric of reconstruction improvement performance defined at small chunk-level and leverage the metric associated distribution information. Both loss schemes aim to encourage a TSE network to pay attention to those SC chunks based on the said distribution information. On this basis, we present X-SepFormer, an end-to-en
Authors
(none)
Tags
Stats
Related papers
- Target Confusion In End-to-end Speaker Extraction: Analysis And Approaches (2022)9.59
- X-crossnet: A Complex Spectral Mapping Approach To Target Speaker Extraction With Cross Attention Speaker Embedding Fusion (2024)0.00
- Improving Curriculum Learning For Target Speaker Extraction With Synthetic Speakers (2024)2.26
- Speaker-conditioning Single-channel Target Speaker Extraction Using Conformer-based Architectures (2022)6.34
- X-tasnet: Robust And Accurate Time-domain Speaker Extraction Network (2020)10.48
- 3S-TSE: Efficient Three-stage Target Speaker Extraction For Real-time And Low-resource Applications (2023)5.24
- Focus On The Sound Around You: Monaural Target Speaker Extraction Via Distance And Speaker Information (2023)7.81
- Target Speech Extraction With Pre-trained Self-supervised Learning Models (2024)9.41