VILAS: Exploring The Effects Of Vision And Language Context In Automatic Speech Recognition
2023 Β· Ziyi Ni, Minglun Han, Feilong Chen, et al.
Abstract
Enhancing automatic speech recognition (ASR) performance by leveraging additional multimodal information has shown promising results in previous studies. However, most of these works have primarily focused on utilizing visual cues derived from human lip motions. In fact, context-dependent visual and linguistic cues can also benefit in many scenarios. In this paper, we first propose ViLaS (Vision and Language into Automatic Speech Recognition), a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism, which can integrate visual and textual context simultaneously or separately, to facilitate speech recognition. Next, we introduce an effective training strategy that improves performance in modal-incomplete test scenarios. Then, to explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions. Finally, empirical results are reported on the public Flickr8K an
Authors
(none)
Tags
Stats
Related papers
- Listen, Look And Deliberate: Visual Context-aware Speech Recognition Using Pre-trained Text-video Representations (2020)5.84
- Analyzing Utility Of Visual Context In Multimodal Speech Recognition Under Noisy Conditions (2019)0.00
- Multimodal Speech Recognition With Unstructured Audio Masking (2020)0.00
- Chinese-lips: A Chinese Audio-visual Speech Recognition Dataset With Lip-reading And Presentation Slides (2025)0.00
- MLCA-AVSR: Multi-layer Cross Attention Fusion Based Audio-visual Speech Recognition (2024)10.07
- Alignvsr: Audio-visual Cross-modal Alignment For Visual Speech Recognition (2024)0.00
- Sviqa: A Unified Speech-vision Multimodal Model For Textless Visual Question Answering (2025)0.00
- Where Visual Speech Meets Language: VSP-LLM Framework For Efficient And Context-aware Visual Speech Processing (2024)0.00