Analyzing Utility Of Visual Context In Multimodal Speech Recognition Under Noisy Conditions
2019 Β· Tejas Srinivasan, Ramon Sanabria, Florian Metze
Abstract
Multimodal learning allows us to leverage information from multiple sources (visual, acoustic and text), similar to our experience of the real world. However, it is currently unclear to what extent auxiliary modalities improve performance over unimodal models, and under what circumstances the auxiliary modalities are useful. We examine the utility of the auxiliary visual context in Multimodal Automatic Speech Recognition in adversarial settings, where we deprive the models from partial audio signal during inference time. Our experiments show that while MMASR models show significant gains over traditional speech-to-text architectures (upto 4.2% WER improvements), they do not incorporate visual information when the audio signal has been corrupted. This shows that current methods of integrating the visual modality do not improve model robustness to noise, and we need better visually grounded adaptation techniques.
Authors
(none)
Tags
Stats
Related papers
- Multimodal Speech Recognition With Unstructured Audio Masking (2020)0.00
- Improving Multimodal Speech Recognition By Data Augmentation And Speech Representations (2022)9.03
- Multimodal Grounding For Sequence-to-sequence Speech Recognition (2018)8.82
- VILAS: Exploring The Effects Of Vision And Language Context In Automatic Speech Recognition (2023)3.58
- Fine-grained Grounding For Multimodal Speech Recognition (2020)5.84
- Leveraging Acoustic Contextual Representation By Audio-textual Cross-modal Learning For Conversational ASR (2022)0.00
- Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy? (2024)0.00
- Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification (2023)9.23