Cross-channel Attention-based Target Speaker Voice Activity Detection: Experimental Results For M2met Challenge
2022 Β· Weiqing Wang, Xiaoyi Qin, Ming Li
Abstract
In this paper, we present the speaker diarization system for the Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) from team DKU_DukeECE. As the highly overlapped speech exists in the dataset, we employ an x-vector-based target-speaker voice activity detection (TS-VAD) to find the overlap between speakers. For the single-channel scenario, we separately train a model for each of the 8 channels and fuse the results. We also employ the cross-channel self-attention to further improve the performance, where the non-linear spatial correlations between different channels are learned and fused. Experimental results on the evaluation set show that the single-channel TS-VAD reduces the DER by over 75% from 12.68% to 3.14%. The multi-channel TS-VAD further reduces the DER by 28% and achieves a DER of 2.26%. Our final submitted system achieves a DER of 2.98% on the AliMeeting test set, which ranks 1st in the M2MET challenge.
Authors
(none)
Tags
Stats
Related papers
- The Ustc-ximalaya System For The ICASSP 2022 Multi-channel Multi-party Meeting Transcription (m2met) Challenge (2022)6.34
- The Volcspeech System For The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (2022)5.84
- The Xmuspeech System For Multi-channel Multi-party Meeting Transcription Challenge (2022)0.00
- The CUHK-TENCENT Speaker Diarization System For The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (2022)7.81
- Royalflush Speaker Diarization System For ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (2022)0.00
- Target-speaker Voice Activity Detection With Improved I-vector Estimation For Unknown Number Of Speaker (2021)10.97
- Target-speaker Voice Activity Detection: A Novel Approach For Multi-speaker Diarization In A Dinner Party Scenario (2020)16.19
- Multi-input Multi-output Target-speaker Voice Activity Detection For Unified, Flexible, And Robust Audio-visual Speaker Diarization (2024)0.00