Neural Speaker Diarization Using Memory-aware Multi-speaker Embedding With Sequence-to-sequence Architecture
2023 Β· Gaobin Yang, Maokui He, Shutong Niu, et al.
Abstract
We propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates the strengths of memory-aware multi-speaker embedding (MA-MSE) and sequence-to-sequence (Seq2Seq) architecture, leading to improvement in both efficiency and performance. Next, we further decrease the memory occupation of decoding by incorporating input features fusion and then employ a multi-head attention mechanism to capture features at different levels. NSD-MS2S achieved a macro diarization error rate (DER) of 15.9% on the CHiME-7 EVAL set, which signifies a relative improvement of 49% over the official baseline system, and is the key technique for us to achieve the best performance for the main track of CHiME-7 DASR Challenge. Additionally, we introduce a deep interactive module (DIM) in MA-MSE module to better retrieve a cleaner and more discriminative multi-speaker embedding, enabling the current model to outperform t
Authors
(none)
Tags
Stats
Related papers
- Semi-supervised Multi-channel Speaker Diarization With Cross-channel Attention (2023)2.26
- NTT Speaker Diarization System For Chime-7: Multi-domain, Multi-microphone End-to-end And Vector Clustering Diarization (2023)7.16
- Sequence-to-sequence Neural Diarization With Automatic Speaker Detection And Representation (2024)6.34
- The Xmuspeech System For Multi-channel Multi-party Meeting Transcription Challenge (2022)0.00
- Multimodal Speaker Segmentation And Diarization Using Lexical And Acoustic Cues Via Sequence To Sequence Neural Networks (2018)9.92
- Incorporating Spatial Cues In Modular Speaker Diarization For Multi-channel Multi-party Meetings (2024)4.52
- Royalflush Speaker Diarization System For ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (2022)0.00
- Speaker Diarization Using Deep Recurrent Convolutional Neural Networks For Speaker Embeddings (2017)9.41