Deep Learning Based Audio-visual Multi-speaker DOA Estimation Using Permutation-free Loss Function
2022 Β· Qing Wang, Hang Chen, Ya Jiang, et al.
Abstract
In this paper, we propose a deep learning based multi-speaker direction of arrival (DOA) estimation with audio and visual signals by using permutation-free loss function. We first collect a data set for multi-modal sound source localization (SSL) where both audio and visual signals are recorded in real-life home TV scenarios. Then we propose a novel spatial annotation method to produce the ground truth of DOA for each speaker with the video data by transformation between camera coordinate and pixel coordinate according to the pin-hole camera model. With spatial location information served as another input along with acoustic feature, multi-speaker DOA estimation could be solved as a classification task of active speaker detection. Label permutation problem in multi-speaker related tasks will be addressed since the locations of each speaker are used as input. Experiments conducted on both simulated data and real data show that the proposed audio-visual DOA estimation model outperforms a
Authors
(none)
Tags
Stats
Related papers
- Multi-speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals (2018)18.46
- Deep Learning Based Multi-source Localization With Source Splitting And Its Effectiveness In Multi-talker Speech Recognition (2021)14.23
- Spatial Loss For Unsupervised Multi-channel Source Separation (2022)7.16
- End-to-end Neural Speaker Diarization With Permutation-free Objectives (2019)21.98
- Data Fusion For Audiovisual Speaker Localization: Extending Dynamic Stream Weights To The Spatial Domain (2021)3.58
- Multi-input Multi-output Target-speaker Voice Activity Detection For Unified, Flexible, And Robust Audio-visual Speaker Diarization (2024)0.00
- Leveraging Visual Supervision For Array-based Active Speaker Detection And Localization (2023)6.77
- Active Speaker Detection As A Multi-objective Optimization With Uncertainty-based Multimodal Fusion (2021)7.50