End-to-end Multi-channel Speaker Extraction And Binaural Speech Synthesis
2024 Β· Cheng Chi, Xiaoyu Li, Yuxuan Ke, et al.
Abstract
Speech clarity and spatial audio immersion are the two most critical factors in enhancing remote conferencing experiences. Existing methods are often limited: either due to the lack of spatial information when using only one microphone, or because their performance is highly dependent on the accuracy of direction-of-arrival estimation when using microphone array. To overcome this issue, we introduce an end-to-end deep learning framework that has the capacity of mapping multi-channel noisy and reverberant signals to clean and spatialized binaural speech directly. This framework unifies source extraction, noise suppression, and binaural rendering into one network. In this framework, a novel magnitude-weighted interaural level difference loss function is proposed that aims to improve the accuracy of spatial rendering. Extensive evaluations show that our method outperforms established baselines in terms of both speech quality and spatial fidelity.
Authors
(none)
Tags
Stats
Related papers
- End-to-end Dereverberation, Beamforming, And Speech Recognition With Improved Numerical Stability And Advanced Frontend (2021)10.97
- Multi-channel Target Speech Extraction With Channel Decorrelation And Target Speaker Adaptation (2020)0.00
- Multi-channel Speaker Verification For Single And Multi-talker Speech (2020)0.00
- SE Territory: Monaural Speech Enhancement Meets The Fixed Virtual Perceptual Space Mapping (2023)0.00
- Single-channel Multi-speaker Separation Using Deep Clustering (2016)0.00
- Multi-geometry Spatial Acoustic Modeling For Distant Speech Recognition (2019)6.34
- Learning-based Personal Speech Enhancement For Teleconferencing By Exploiting Spatial-spectral Features (2021)6.34
- Efficient Multi-channel Speech Enhancement With Spherical Harmonics Injection For Directional Encoding (2023)3.58