Rule-embedded Network For Audio-visual Voice Activity Detection In Live Musical Video Streams
2020 Β· Yuanbo Hou, Yi Deng, Bilei Zhu, et al.
Abstract
Detecting anchor's voice in live musical streams is an important preprocessing for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. With the help of visual information, this paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs to help the model better detect target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as the mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion by the proposed rule, the detection result of A-V branch outperforms that of audio branch; 2) the performance of bi-modal model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attrac
Authors
(none)
Tags
Stats
Related papers
- Attention-based Cross-modal Fusion For Audio-visual Voice Activity Detection In Musical Video Streams (2021)5.24
- Voice Activity Detection: Merging Source And Filter-based Information (2019)13.50
- MLNET: An Adaptive Multiple Receptive-field Attention Neural Network For Voice Activity Detection (2020)3.58
- X-vector Based Voice Activity Detection For Multi-genre Broadcast Speech-to-text (2021)0.00
- Personal VAD: Speaker-conditioned Voice Activity Detection (2019)13.05
- Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00
- Multi-input Multi-output Target-speaker Voice Activity Detection For Unified, Flexible, And Robust Audio-visual Speaker Diarization (2024)0.00
- Speech Enhancement Aided End-to-end Multi-task Learning For Voice Activity Detection (2020)11.49