Profile-error-tolerant Target-speaker Voice Activity Detection
2023 Β· Dongmei Wang, Xiong Xiao, Naoyuki Kanda, et al.
Abstract
Target-Speaker Voice Activity Detection (TS-VAD) utilizes a set of speaker profiles alongside an input audio signal to perform speaker diarization. While its superiority over conventional methods has been demonstrated, the method can suffer from errors in speaker profiles, as those profiles are typically obtained by running a traditional clustering-based diarization method over the input signal. This paper proposes an extension to TS-VAD, called Profile-Error-Tolerant TS-VAD (PET-TSVAD), which is robust to such speaker profile errors. This is achieved by employing transformer-based TS-VAD that can handle a variable number of speakers and further introducing a set of additional pseudo-speaker profiles to handle speakers undetected during the first pass diarization. During training, we use speaker profiles estimated by multiple different clustering algorithms to reduce the mismatch between the training and testing conditions regarding speaker profiles. Experimental results show that PET-
Authors
(none)
Tags
Stats
Related papers
- Target-speaker Voice Activity Detection With Improved I-vector Estimation For Unknown Number Of Speaker (2021)10.97
- Target-speaker Voice Activity Detection Via Sequence-to-sequence Prediction (2022)11.19
- Target Speaker Voice Activity Detection With Transformers And Its Integration With End-to-end Neural Diarization (2022)10.48
- Multi-input Multi-output Target-speaker Voice Activity Detection For Unified, Flexible, And Robust Audio-visual Speaker Diarization (2024)0.00
- Target-speaker Voice Activity Detection: A Novel Approach For Multi-speaker Diarization In A Dinner Party Scenario (2020)16.19
- Noise-robust Target-speaker Voice Activity Detection Through Self-supervised Pretraining (2025)0.00
- Cross-channel Attention-based Target Speaker Voice Activity Detection: Experimental Results For M2met Challenge (2022)10.07
- Continuous Target Speech Extraction: Enhancing Personalized Diarization And Extraction On Complex Recordings (2024)3.58