Attention-based Cross-modal Fusion For Audio-visual Voice Activity Detection In Musical Video Streams
2021 Β· Yuanbo Hou, Zhesong Yu, Xia Liang, et al.
Abstract
Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audiovisual information. To integrate information of audio and visual modalities, a multi-branch network is proposed to learn audio and image representations, and the representations are fused by attention based on semantic similarity to shape the acoustic representations through the probability of anchor vocalization. Experiments show the proposed audio-visual multi-branch network far outperforms the audio-only model in challenging acoustic environments, indicating the cross-modal information fusion based on semantic correlation is sensible and successful.
Authors
(none)
Tags
Stats
Related papers
- Rule-embedded Network For Audio-visual Voice Activity Detection In Live Musical Video Streams (2020)4.52
- Audio-visual Speech Separation Based On Joint Feature Representation With Cross-modal Attention (2022)0.00
- AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00
- Attentive Fusion Enhanced Audio-visual Encoding For Transformer Based Robust Speech Recognition (2020)0.00
- Dual-path Cross-modal Attention For Better Audio-visual Speech Extraction (2022)0.00
- A Joint Cross-attention Model For Audio-visual Fusion In Dimensional Emotion Recognition (2022)18.00
- Multi-input Multi-output Target-speaker Voice Activity Detection For Unified, Flexible, And Robust Audio-visual Speaker Diarization (2024)0.00
- Active Speaker Detection As A Multi-objective Optimization With Uncertainty-based Multimodal Fusion (2021)7.50