DCF-DS: Deep Cascade Fusion Of Diarization And Separation For Speech Recognition Under Realistic Single-channel Conditions
2024 Β· Shu-Tong Niu, Jun Du, Ruo-Yu Wang, et al.
Abstract
We propose a single-channel Deep Cascade Fusion of Diarization and Separation (DCF-DS) framework for back-end automatic speech recognition (ASR), combining neural speaker diarization (NSD) and speech separation (SS). First, we sequentially integrate the NSD and SS modules within a joint training framework, enabling the separation module to leverage speaker time boundaries from the diarization module effectively. Then, to complement DCF-DS training, we introduce a window-level decoding scheme that allows the DCF-DS framework to handle the sparse data convergence instability (SDCI) problem. We also explore using an NSD system trained on real datasets to provide more accurate speaker boundaries. Additionally, we incorporate an optional multi-input multi-output speech enhancement module (MIMO-SE) within the DCF-DS framework, which offers further performance gains. Finally, we enhance diarization results by re-clustering DCF-DS outputs, improving ASR accuracy. By incorporating the DCF-DS me
Authors
(none)
Tags
Stats
Related papers
- Neural Blind Source Separation And Diarization For Distant Speech Recognition (2024)0.00
- Integration Of Speech Separation, Diarization, And Recognition For Multi-speaker Meetings: System Description, Comparison, And Analysis (2020)13.23
- Low-latency Speech Separation Guided Diarization For Telephone Conversations (2022)6.77
- Neural Speaker Diarization Using Memory-aware Multi-speaker Embedding With Sequence-to-sequence Architecture (2023)3.87
- Incorporating Spatial Cues In Modular Speaker Diarization For Multi-channel Multi-party Meetings (2024)4.52
- Dualsep: A Light-weight Dual-encoder Convolutional Recurrent Network For Real-time In-car Speech Separation (2024)0.00
- An Efficient Speech Separation Network Based On Recurrent Fusion Dilated Convolution And Channel Attention (2023)0.00
- End-to-end Integration Of Speech Separation And Voice Activity Detection For Low-latency Diarization Of Telephone Conversations (2023)4.52