Time-domain Speech Extraction With Spatial Information And Multi Speaker Conditioning Mechanism
2021 Β· Jisi Zhang, Catalin Zorila, Rama Doddipatla, et al.
Abstract
In this paper, we present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture in noisy and reverberant environments. The proposed method is built on an improved multi-channel time-domain speech separation network which employs speaker embeddings to identify and extract multiple targets without label permutation ambiguity. To efficiently inform the speaker information to the extraction model, we propose a new speaker conditioning mechanism by designing an additional speaker branch for receiving external speaker embeddings. Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline, and it increases the speech recognition accuracy by more than 16% relative over the same baseline.
Authors
(none)
Tags
Stats
Related papers
- Multi-channel Target Speech Extraction With Channel Decorrelation And Target Speaker Adaptation (2020)0.00
- Speaker Conditioning Of Acoustic Models Using Affine Transformation For Multi-speaker Speech Recognition (2021)0.00
- Multi-channel Speaker Verification For Single And Multi-talker Speech (2020)0.00
- A Two-stage Speaker Extraction Algorithm Under Adverse Acoustic Conditions Using A Single-microphone (2023)0.00
- Improving Channel Decorrelation For Multi-channel Target Speech Extraction (2021)6.34
- Voicefilter: Targeted Voice Separation By Speaker-conditioned Spectrogram Masking (2018)17.48
- Speaker-conditioning Single-channel Target Speaker Extraction Using Conformer-based Architectures (2022)6.34
- Speaker Reinforcement Using Target Source Extraction For Robust Automatic Speech Recognition (2022)7.50