Data Fusion For Audiovisual Speaker Localization: Extending Dynamic Stream Weights To The Spatial Domain
2021 Β· Julio Wissing, Benedikt Boenninghoff, Dorothea Kolossa, et al.
Abstract
Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. Both applications benefit from a known speaker position when, for instance, applying beamforming or assigning unique speaker identities. Recently, several approaches utilizing acoustic signals augmented with visual data have been proposed for this task. However, both the acoustic and the visual modality may be corrupted in specific spatial regions, for instance due to poor lighting conditions or to the presence of background noise. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions in the localization space. This fusion is achieved via a neural network, which combines the predictions of individual audio and video trackers based on their time- and location-dependent reliability. A performance evaluation using audiovisual recordings yields promising results, w
Authors
(none)
Tags
Stats
Related papers
- Late Audio-visual Fusion For In-the-wild Speaker Diarization (2022)3.58
- Audio-visual Speaker Diarization Based On Spatiotemporal Bayesian Fusion (2016)14.51
- Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00
- Audio-visual Speaker Tracking: Progress, Challenges, And Future Directions (2023)0.00
- Attentive Fusion Enhanced Audio-visual Encoding For Transformer Based Robust Speech Recognition (2020)0.00
- Audio-visual Speech Separation Based On Joint Feature Representation With Cross-modal Attention (2022)0.00
- Active Speaker Detection As A Multi-objective Optimization With Uncertainty-based Multimodal Fusion (2021)7.50
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67