A Study Of Dropout-induced Modality Bias On Robustness To Missing Video Frames For Audio-visual Speech Recognition
2024 Β· Yusheng Dai, Hang Chen, Jun Du, et al.
Abstract
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames, performing even worse than single-modality models. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. In this paper, we investigate this contrasting phenomenon from the perspective of modality bias and reveal that an excessive modality bias on the audio caused by dropout is the underlying reason. Moreover, we present the Modality Bias Hypothesis (MBH) to systematically describe the relationship between modality bias and robustness against missing modality in multimodal systems. Building on these findings, we propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality and to maintain performance and robustness simultaneously. Finally, to address an entirel
Authors
(none)
Tags
Stats
Related papers
- Watch Or Listen: Robust Audio-visual Speech Recognition With Visual Corruption Modeling And Reliability Scoring (2023)0.00
- Modality Dropout For Multimodal Device Directed Speech Detection Using Verbal And Non-verbal Features (2023)0.00
- Quantifying Multimodal Imbalance: A Gmm-guided Adaptive Loss For Audio-visual Learning (2025)0.00
- Investigating Modality Bias In Audio Visual Video Parsing (2022)0.00
- Analyzing Utility Of Visual Context In Multimodal Speech Recognition Under Noisy Conditions (2019)0.00
- Push-pull: Characterizing The Adversarial Robustness For Audio-visual Active Speaker Detection (2022)4.52
- Attribution Regularization For Multimodal Paradigms (2024)0.00
- Enhancing Real-world Active Speaker Detection With Multi-modal Extraction Pre-training (2024)5.24