Multimodal Speech Recognition With Unstructured Audio Masking
2020 Β· Tejas Srinivasan, Ramon Sanabria, Florian Metze, et al.
Abstract
Visual context has been shown to be useful for automatic speech recognition (ASR) systems when the speech signal is noisy or corrupted. Previous work, however, has only demonstrated the utility of visual context in an unrealistic setting, where a fixed set of words are systematically masked in the audio. In this paper, we simulate a more realistic masking scenario during model training, called RandWordMask, where the masking can occur for any word segment. Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting. Moreover, our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted. These results show that multimodal ASR systems can leverage the visual signal in more generalized noisy scenarios.
Authors
(none)
Tags
Stats
Related papers
- Analyzing Utility Of Visual Context In Multimodal Speech Recognition Under Noisy Conditions (2019)0.00
- Fine-grained Grounding For Multimodal Speech Recognition (2020)5.84
- Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition (2022)6.77
- Leveraging Acoustic Contextual Representation By Audio-textual Cross-modal Learning For Conversational ASR (2022)0.00
- VILAS: Exploring The Effects Of Vision And Language Context In Automatic Speech Recognition (2023)3.58
- Multimodal Grounding For Sequence-to-sequence Speech Recognition (2018)8.82
- End-to-end Multimodal Speech Recognition (2018)10.21
- End-to-end Multi-talker Audio-visual ASR Using An Active Speaker Attention Module (2022)0.00