Alignvsr: Audio-visual Cross-modal Alignment For Visual Speech Recognition
2024 Β· Zehua Liu, Xiaolou Li, Chen Chen, et al.
Abstract
Visual Speech Recognition (VSR) aims to recognize corresponding text by analyzing visual information from lip movements. Due to the high variability and weak information of lip movements, VSR tasks require effectively utilizing any information from any source and at any level. In this paper, we propose a VSR method based on audio-visual cross-modal alignment, named AlignVSR. The method leverages the audio modality as an auxiliary information source and utilizes the global and local correspondence between the audio and visual modalities to improve visual-to-text inference. Specifically, the method first captures global alignment between video and audio through a cross-modal attention mechanism from video frames to a bank of audio units. Then, based on the temporal correspondence between audio and video, a frame-level local alignment loss is introduced to refine the global alignment, improving the utility of the audio information. Experimental results on the LRS2 and CNVSRC.Single datase
Authors
(none)
Tags
Stats
Related papers
- Syncvsr: Data-efficient Visual Speech Recognition With End-to-end Crossmodal Audio Token Synchronization (2024)8.35
- Cross-modal Global Interaction And Local Alignment For Audio-visual Speech Recognition (2023)7.50
- How To Teach Dnns To Pay Attention To The Visual Modality In Speech Recognition (2020)10.97
- Evaluation Of Audio-visual Alignments In Visually Grounded Speech Models (2021)5.84
- Lip2vec: Efficient And Robust Visual Speech Recognition Via Latent-to-latent Visual To Audio Representation Mapping (2023)6.77
- Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification (2023)9.23
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- MLCA-AVSR: Multi-layer Cross Attention Fusion Based Audio-visual Speech Recognition (2024)10.07