Multimodal Grounding For Sequence-to-sequence Speech Recognition
2018 Β· Ozan Caglayan, Ramon Sanabria, Shruti Palaskar, et al.
Abstract
Humans are capable of processing speech by making use of multiple sensory modalities. For example, the environment where a conversation takes place generally provides semantic and/or acoustic context that helps us to resolve ambiguities or to recall named entities. Motivated by this, there have been many works studying the integration of visual information into the speech recognition pipeline. Specifically, in our previous work, we propose a multistep visual adaptive training approach which improves the accuracy of an audio-based Automatic Speech Recognition (ASR) system. This approach, however, is not end-to-end as it requires fine-tuning the whole model with an adaptation layer. In this paper, we propose novel end-to-end multimodal ASR systems and compare them to the adaptive approach by using a range of visual representations obtained from state-of-the-art convolutional neural networks. We show that adaptive training is effective for S2S models leading to an absolute improvement of
Authors
(none)
Tags
Stats
Related papers
- End-to-end Multimodal Speech Recognition (2018)10.21
- Fine-grained Grounding For Multimodal Speech Recognition (2020)5.84
- Improving Multimodal Speech Recognition By Data Augmentation And Speech Representations (2022)9.03
- Listening While Speaking And Visualizing: Improving ASR Through Multimodal Chain (2019)4.52
- Audio-visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks (2017)17.39
- An Investigation Of End-to-end Multichannel Speech Recognition For Reverberant And Mismatch Conditions (2019)0.00
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- Analyzing Utility Of Visual Context In Multimodal Speech Recognition Under Noisy Conditions (2019)0.00