Incorporating Spatial Cues In Modular Speaker Diarization For Multi-channel Multi-party Meetings
2024 Β· Ruoyu Wang, Shutong Niu, Gaobin Yang, et al.
Abstract
Although fully end-to-end speaker diarization systems have made significant progress in recent years, modular systems often achieve superior results in real-world scenarios due to their greater adaptability and robustness. Historically, modular speaker diarization methods have seldom discussed how to leverage spatial cues from multi-channel speech. This paper proposes a three-stage modular system to enhance single-channel neural speaker diarization systems and recognition performance by utilizing spatial cues from multi-channel speech to provide more accurate initialization for each stage of neural speaker diarization (NSD) decoding: (1) Overlap detection and continuous speech separation (CSS) on multi-channel speech are used to obtain cleaner single speaker speech segments for clustering, followed by the first NSD decoding pass. (2) The results from the first pass initialize a complex Angular Central Gaussian Mixture Model (cACGMM) to estimate speaker-wise masks on multi-channel speec
Authors
(none)
Tags
Stats
Related papers
- Joint Training Of Speaker Embedding Extractor, Speech And Overlap Detection For Diarization (2024)2.26
- Integrating Audio, Visual, And Semantic Information For Enhanced Multimodal Speaker Diarization (2024)0.00
- Listening To Multi-talker Conversations: Modular And End-to-end Perspectives (2024)0.00
- Integration Of Speech Separation, Diarization, And Recognition For Multi-speaker Meetings: System Description, Comparison, And Analysis (2020)13.23
- Neural Speaker Diarization Using Memory-aware Multi-speaker Embedding With Sequence-to-sequence Architecture (2023)3.87
- Multimodal Speaker Segmentation And Diarization Using Lexical And Acoustic Cues Via Sequence To Sequence Neural Networks (2018)9.92
- Semi-supervised Multi-channel Speaker Diarization With Cross-channel Attention (2023)2.26
- Exploring Speaker-related Information In Spoken Language Understanding For Better Speaker Diarization (2023)0.00