NTT Speaker Diarization System For Chime-7: Multi-domain, Multi-microphone End-to-end And Vector Clustering Diarization
2023 Β· Naohiro Tawara, Marc Delcroix, Atsushi Ando, et al.
Abstract
This paper details our speaker diarization system designed for multi-domain, multi-microphone casual conversations. The proposed diarization pipeline uses weighted prediction error (WPE)-based dereverberation as a front end, then applies end-to-end neural diarization with vector clustering (EEND-VC) to each channel separately. It integrates the diarization result obtained from each channel using diarization output voting error reduction plus overlap (DOVER-LAP). To harness the knowledge from the target domain and results integrated across all channels, we apply self-supervised adaptation for each session by retraining the EEND-VC with pseudo-labels derived from DOVER-LAP. The proposed system was incorporated into NTT's submission for the distant automatic speech recognition task in the CHiME-7 challenge. Our system achieved 65 % and 62 % relative improvements on development and eval sets compared to the organizer-provided VC-based baseline diarization system, securing third place in di
Authors
(none)
Tags
Stats
Related papers
- Neural Speaker Diarization Using Memory-aware Multi-speaker Embedding With Sequence-to-sequence Architecture (2023)3.87
- Advances In Integration Of End-to-end Neural And Clustering-based Diarization For Real Conversational Speech (2021)16.48
- Multi-channel End-to-end Neural Diarization With Distributed Microphones (2021)10.21
- Target-speaker Voice Activity Detection: A Novel Approach For Multi-speaker Diarization In A Dinner Party Scenario (2020)16.19
- An Experimental Review Of Speaker Diarization Methods With Application To Two-speaker Conversational Telephone Speech Recordings (2023)8.35
- Royalflush Speaker Diarization System For ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (2022)0.00
- Microsoft Speaker Diarization System For The Voxceleb Speaker Recognition Challenge 2020 (2020)11.93
- Semi-supervised Multi-channel Speaker Diarization With Cross-channel Attention (2023)2.26