Fine-grained Grounding For Multimodal Speech Recognition
2020 Β· Tejas Srinivasan, Ramon Sanabria, Florian Metze, et al.
Abstract
Multimodal automatic speech recognition systems integrate information from images to improve speech recognition quality, by grounding the speech in the visual context. While visual signals have been shown to be useful for recovering entities that have been masked in the audio, these models should be capable of recovering a broader range of word types. Existing systems rely on global visual features that represent the entire image, but localizing the relevant regions of the image will make it possible to recover a larger set of words, such as adjectives and verbs. In this paper, we propose a model that uses finer-grained visual information from different parts of the image, using automatic object proposals. In experiments on the Flickr8K Audio Captions Corpus, we find that our model improves over approaches that use global visual features, that the proposals enable the model to recover entities and other related words, such as adjectives, and that improvements are due to the model's abi
Authors
(none)
Tags
Stats
Related papers
- Multimodal Grounding For Sequence-to-sequence Speech Recognition (2018)8.82
- Multimodal Speech Recognition With Unstructured Audio Masking (2020)0.00
- Evaluation Of Audio-visual Alignments In Visually Grounded Speech Models (2021)5.84
- Analyzing Utility Of Visual Context In Multimodal Speech Recognition Under Noisy Conditions (2019)0.00
- Jointly Discovering Visual Objects And Spoken Words From Raw Sensory Input (2018)14.27
- Transfer Learning From Audio-visual Grounding To Speech Recognition (2019)9.59
- Large-scale Representation Learning From Visually Grounded Untranscribed Speech (2019)10.48
- Improving Multimodal Speech Recognition By Data Augmentation And Speech Representations (2022)9.03