Lira: Learning Visual Speech Representations From Audio Through Self-supervision
2021 Β· Pingchuan Ma, Rodrigo Mira, Stavros Petridis, et al.
Abstract
The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training objective to learn from the other. In this work, we propose Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild (LRW) dataset and achieves state-of-the-art performance on Lip Reading Sentences
Authors
(none)
Tags
Stats
Related papers
- Lipsound2: Self-supervised Pre-training For Lip-to-speech Reconstruction And Lip Reading (2021)11.39
- Target Speaker Lipreading By Audio-visual Self-distillation Pretraining And Speaker Adaptation (2025)5.24
- Learning Speech Representations From Raw Audio By Joint Audiovisual Self-supervision (2020)0.00
- Litevsr: Efficient Visual Speech Recognition By Learning From Speech Representations Of Unlabeled Data (2023)5.84
- Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading (2023)3.89
- Chinese-lips: A Chinese Audio-visual Speech Recognition Dataset With Lip-reading And Presentation Slides (2025)0.00
- Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification (2023)9.23
- From Hype To Insight: Rethinking Large Language Model Integration In Visual Speech Recognition (2025)0.00