Lip2vec: Efficient And Robust Visual Speech Recognition Via Latent-to-latent Visual To Audio Representation Mapping
2023 Β· Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Haithem Boussaid, et al.
Abstract
Visual Speech Recognition (VSR) differs from the common perception tasks as it requires deeper reasoning over the video sequence, even by human experts. Despite the recent advances in VSR, current approaches rely on labeled data to fully train or finetune their models predicting the target speech. This hinders their ability to generalize well beyond the training set and leads to performance degeneration under out-of-distribution challenging scenarios. Unlike previous works that involve auxiliary losses or complex training procedures and architectures, we propose a simple approach, named Lip2Vec that is based on learning a prior model. Given a robust visual speech encoder, this network maps the encoded latent representations of the lip sequence to their corresponding latents from the audio pair, which are sufficiently invariant for effective text decoding. The generated audio representation is then decoded to text using an off-the-shelf Audio Speech Recognition (ASR) model. The proposed
Authors
(none)
Tags
Stats
Related papers
- Litevsr: Efficient Visual Speech Recognition By Learning From Speech Representations Of Unlabeled Data (2023)5.84
- Alignvsr: Audio-visual Cross-modal Alignment For Visual Speech Recognition (2024)0.00
- Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading (2023)3.89
- Synthvsr: Scaling Up Visual Speech Recognition With Synthetic Supervision (2023)9.76
- Lipger: Visually-conditioned Generative Error Correction For Robust Automatic Speech Recognition (2024)2.26
- La-voce: Low-snr Audio-visual Speech Enhancement Using Neural Vocoders (2022)0.00
- Lipsound2: Self-supervised Pre-training For Lip-to-speech Reconstruction And Lip Reading (2021)11.39
- Robustl2s: Speaker-specific Lip-to-speech Synthesis Exploiting Self-supervised Representations (2023)4.52