Predict-and-update Network: Audio-visual Speech Recognition Inspired By Human Speech Perception
2022 Β· Jiadong Wang, Xinyuan Qian, Haizhou Li
Abstract
Audio and visual signals complement each other in human speech perception, so do they in speech recognition. The visual hint is less evident than the acoustic hint, but more robust in a complex acoustic environment, as far as speech perception is concerned. It remains a challenge how we effectively exploit the interaction between audio and visual signals for automatic speech recognition. There have been studies to exploit visual signals as redundant or complementary information to audio input in a synchronous manner. Human studies suggest that visual signal primes the listener in advance as to when and on which frequency to attend to. We propose a Predict-and-Update Network (P&U net), to simulate such a visual cueing mechanism for Audio-Visual Speech Recognition (AVSR). In particular, we first predict the character posteriors of the spoken words, i.e. the visual embedding, based on the visual signals. The audio signal is then conditioned on the visual embedding via a novel cross-modal
Authors
(none)
Tags
Stats
Related papers
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00
- How To Teach Dnns To Pay Attention To The Visual Modality In Speech Recognition (2020)10.97
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Revise: Self-supervised Speech Resynthesis With Visual Input For Universal And Generalized Speech Enhancement (2022)0.00
- AKVSR: Audio Knowledge Empowered Visual Speech Recognition By Compressing Audio Knowledge Of A Pretrained Model (2023)8.35
- Alignvsr: Audio-visual Cross-modal Alignment For Visual Speech Recognition (2024)0.00
- Dual-path Cross-modal Attention For Better Audio-visual Speech Extraction (2022)0.00
- Audio-visual Multi-channel Speech Separation, Dereverberation And Recognition (2022)6.77