Target-speaker Voice Activity Detection Via Sequence-to-sequence Prediction
2022 Β· Ming Cheng, Weiqing Wang, Yucong Zhang, et al.
Abstract
Target-speaker voice activity detection is currently a promising approach for speaker diarization in complex acoustic environments. This paper presents a novel Sequence-to-Sequence Target-Speaker Voice Activity Detection (Seq2Seq-TSVAD) method that can efficiently address the joint modeling of large-scale speakers and predict high-resolution voice activities. Experimental results show that larger speaker capacity and higher output resolution can significantly reduce the diarization error rate (DER), which achieves the new state-of-the-art performance of 4.55% on the VoxConverse test set and 10.77% on Track 1 of the DIHARD-III evaluation set under the widely-used evaluation metrics.
Authors
(none)
Tags
Stats
Related papers
- Target-speaker Voice Activity Detection: A Novel Approach For Multi-speaker Diarization In A Dinner Party Scenario (2020)16.19
- Target-speaker Voice Activity Detection With Improved I-vector Estimation For Unknown Number Of Speaker (2021)10.97
- Multi-input Multi-output Target-speaker Voice Activity Detection For Unified, Flexible, And Robust Audio-visual Speaker Diarization (2024)0.00
- TS-SEP: Joint Diarization And Separation Conditioned On Estimated Speaker Embeddings (2023)10.35
- Profile-error-tolerant Target-speaker Voice Activity Detection (2023)6.77
- Target Speaker Voice Activity Detection With Transformers And Its Integration With End-to-end Neural Diarization (2022)10.48
- Continuous Target Speech Extraction: Enhancing Personalized Diarization And Extraction On Complex Recordings (2024)3.58
- Cross-channel Attention-based Target Speaker Voice Activity Detection: Experimental Results For M2met Challenge (2022)10.07