Multi-target Extractor And Detector For Unknown-number Speaker Diarization
2022 Β· Chin-Yi Cheng, Hung-Shin Lee, Yu Tsao, et al.
Abstract
Strong representations of target speakers can help extract important information about speakers and detect corresponding temporal regions in multi-speaker conversations. In this study, we propose a neural architecture that simultaneously extracts speaker representations consistent with the speaker diarization objective and detects the presence of each speaker on a frame-by-frame basis regardless of the number of speakers in a conversation. A speaker representation (called z-vector) extractor and a time-speaker contextualizer, implemented by a residual network and processing data in both temporal and speaker dimensions, are integrated into a unified framework. Tests on the CALLHOME corpus show that our model outperforms most of the methods proposed so far. Evaluations in a more challenging case with simultaneous speakers ranging from 2 to 7 show that our model achieves 6.4% to 30.9% relative diarization error rate reductions over several typical baselines.
Authors
(none)
Tags
Stats
Related papers
- USED: Universal Speaker Extraction And Diarization (2023)7.50
- Multi-scale Speaker Diarization With Neural Affinity Score Fusion (2020)6.77
- Target-speaker Voice Activity Detection With Improved I-vector Estimation For Unknown Number Of Speaker (2021)10.97
- Leveraging Speaker Embeddings In End-to-end Neural Diarization For Two-speaker Scenarios (2024)0.00
- Target-speaker Voice Activity Detection: A Novel Approach For Multi-speaker Diarization In A Dinner Party Scenario (2020)16.19
- Combination Of Deep Speaker Embeddings For Diarisation (2020)8.60
- EEND-SS: Joint End-to-end Neural Speaker Diarization And Speech Separation For Flexible Number Of Speakers (2022)10.35
- Simultaneous Speech Recognition And Speaker Diarization For Monaural Dialogue Recordings With Target-speaker Acoustic Models (2019)0.00