Leveraging Real Conversational Data For Multi-channel Continuous Speech Separation
2022 Β· Xiaofei Wang, Dongmei Wang, Naoyuki Kanda, et al.
Abstract
Existing multi-channel continuous speech separation (CSS) models are heavily dependent on supervised data - either simulated data which causes data mismatch between the training and real-data testing, or the real transcribed overlapping data, which is difficult to be acquired, hindering further improvements in the conversational/meeting transcription tasks. In this paper, we propose a three-stage training scheme for the CSS model that can leverage both supervised data and extra large-scale unsupervised real-world conversational data. The scheme consists of two conventional training approaches -- pre-training using simulated data and ASR-loss-based training using transcribed data -- and a novel continuous semi-supervised training between the two, in which the CSS model is further trained by using real data based on the teacher-student learning framework. We apply this scheme to an array-geometry-agnostic CSS model, which can use the multi-channel data collected from any microphone array
Authors
(none)
Tags
Stats
Related papers
- Conversational Speech Separation: An Evaluation Study For Streaming Applications (2022)0.00
- Low-latency Speaker-independent Continuous Speech Separation (2019)9.23
- Meeting Recognition With Continuous Speech Separation And Transcription-supported Diarization (2023)6.77
- CONCSS: Contrastive-based Context Comprehension For Dialogue-appropriate Prosody In Conversational Speech Synthesis (2023)0.00
- Emotion Rendering For Conversational Speech Synthesis With Heterogeneous Graph-based Context Modeling (2023)13.15
- Intra- And Inter-modal Context Interaction Modeling For Conversational Speech Synthesis (2024)4.53
- Investigation Of Practical Aspects Of Single Channel Speech Separation For ASR (2021)7.81
- Directed Speech Separation For Automatic Speech Recognition Of Long Form Conversational Speech (2021)2.26