LS-EEND: Long-form Streaming End-to-end Neural Diarization With Online Attractor Extraction
2024 Β· di Liang, Xiaofei Li
Abstract
This work proposes a frame-wise online/streaming end-to-end neural diarization (EEND) method, which detects speaker activities in a frame-in-frame-out fashion. The proposed model mainly consists of a causal embedding encoder and an online attractor decoder. Speakers are modeled in the self-attention-based decoder along both the time and speaker dimensions, and frame-wise speaker attractors are automatically generated and updated for new speakers and existing speakers, respectively. Retention mechanism is employed and especially adapted for long-form diarization with a linear temporal complexity. A multi-step progressive training strategy is proposed for gradually learning from easy tasks to hard tasks in terms of the number of speakers and audio length. Finally, the proposed model (referred to as long-form streaming EEND, LS-EEND) is able to perform streaming diarization for a high (up to 8) and flexible number speakers and very long (say one hour) audio recordings. Experiments on vari
Authors
(none)
Tags
Stats
Related papers
- Frame-wise Streaming End-to-end Speaker Diarization With Non-autoregressive Self-attention-based Attractors (2023)2.26
- Encoder-decoder Based Attractors For End-to-end Neural Diarization (2021)13.05
- Speakers Unembedded: Embedding-free Approach To Long-form Neural Diarization (2024)3.58
- BW-EDA-EEND: Streaming End-to-end Neural Speaker Diarization For A Variable Number Of Speakers (2020)10.74
- Speech-aware Neural Diarization With Encoder-decoder Attractor Guided By Attention Constraints (2024)0.00
- Online Neural Diarization Of Unlimited Numbers Of Speakers Using Global And Local Attractors (2022)10.07
- Online Streaming End-to-end Neural Diarization Handling Overlapping Speech And Flexible Numbers Of Speakers (2021)0.00
- Online End-to-end Neural Diarization With Speaker-tracing Buffer (2020)10.74