Simultaneous Speech Extraction For Multiple Target Speakers Under The Meeting Scenarios
2022 Β· Bang Zeng, Hongbing Suo, Yulong Wan, et al.
Abstract
The common target speech separation directly estimate the target source, ignoring the interrelationship between different speakers at each frame. We propose a multiple-target speech separation model (MTSS) to simultaneously extract each speaker's voice from the mixed speech rather than just optimally estimating the target source. Moreover, we propose a speaker diarization (SD) aware MTSS system (SD-MTSS), which consists of a SD module and MTSS module. By exploiting the TSVAD decision and the estimated mask, our SD-MTSS model can extract the speech signal of each speaker concurrently in a conversational recording without additional enrollment audio in advance. Experimental results show that our MTSS model achieves 1.38dB SDR, 1.34dB SI-SDR, and 0.13 PESQ improvements over the baseline on the WSJ0-2mix-extr dataset, respectively. The SD-MTSS system makes 19.2% relative speaker dependent character error rate (CER) reduction on the Alimeeting dataset.
Authors
(none)
Tags
Stats
Related papers
- TS-SEP: Joint Diarization And Separation Conditioned On Estimated Speaker Embeddings (2023)10.35
- Simultaneous Speech Recognition And Speaker Diarization For Monaural Dialogue Recordings With Target-speaker Acoustic Models (2019)0.00
- Integration Of Speech Separation, Diarization, And Recognition For Multi-speaker Meetings: System Description, Comparison, And Analysis (2020)13.23
- The Ustc-ximalaya System For The ICASSP 2022 Multi-channel Multi-party Meeting Transcription (m2met) Challenge (2022)6.34
- A Comparative Study On Speaker-attributed Automatic Speech Recognition In Multi-party Meetings (2022)8.09
- Continuous Target Speech Extraction: Enhancing Personalized Diarization And Extraction On Complex Recordings (2024)3.58
- Continuous Speech Separation Using Speaker Inventory For Long Multi-talker Recording (2020)7.50
- Audio-visual Active Speaker Extraction For Sparsely Overlapped Multi-talker Speech (2023)7.50