Continuous Target Speech Extraction: Enhancing Personalized Diarization And Extraction On Complex Recordings
2024 Β· He Zhao, Hangting Chen, Jianwei Yu, et al.
Abstract
Target speaker extraction (TSE) aims to extract the target speaker's voice from the input mixture. Previous studies have concentrated on high-overlapping scenarios. However, real-world applications usually meet more complex scenarios like variable speaker overlapping and target speaker absence. In this paper, we introduces a framework to perform continuous TSE (C-TSE), comprising a target speaker voice activation detection (TSVAD) and a TSE model. This framework significantly improves TSE performance on similar speakers and enhances personalization, which is lacking in traditional diarization methods. In detail, unlike conventional TSVAD deployed to refine the diarization results, the proposed Attention-target speaker voice activation detection (A-TSVAD) directly generates timestamps of the target speaker. We also explore some different integration methods of A-TSVAD and TSE by comparing the cascaded and parallel methods. The framework's effectiveness is assessed using a range of metri
Authors
(none)
Tags
Stats
Related papers
- Target-speaker Voice Activity Detection Via Sequence-to-sequence Prediction (2022)11.19
- Speakerbeam-ss: Real-time Target Speaker Extraction With Lightweight Conv-tasnet And State Space Modeling (2024)7.16
- Contextual Speech Extraction: Leveraging Textual History As An Implicit Cue For Target Speech Extraction (2025)2.26
- 3S-TSE: Efficient Three-stage Target Speaker Extraction For Real-time And Low-resource Applications (2023)5.24
- Typing To Listen At The Cocktail Party: Text-guided Target Speaker Extraction (2023)3.58
- Target Speaker Extraction By Directly Exploiting Contextual Information In The Time-frequency Domain (2024)9.59
- X-crossnet: A Complex Spectral Mapping Approach To Target Speaker Extraction With Cross Attention Speaker Embedding Fusion (2024)0.00
- Lightweight Speech Enhancement Guided Target Speech Extraction In Noisy Multi-speaker Scenarios (2025)0.00