Synthvsr: Scaling Up Visual Speech Recognition With Synthetic Supervision
2023 Β· Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, et al.
Abstract
Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems with synthetic lip movements. The key idea behind SynthVSR is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech. The speech-driven lip animation model is trained on an unlabeled audio-visual dataset and could be further optimized towards a pre-trained VSR model when labeled videos are available. As plenty of transcribed acoustic data and face images are available, we are able to generate large-scale synthetic data using the proposed lip animation model for semi-supervised VSR training. We evaluate the performance of our approach on the la
Authors
(none)
Tags
Stats
Related papers
- Syncvsr: Data-efficient Visual Speech Recognition With End-to-end Crossmodal Audio Token Synchronization (2024)8.35
- Lip2vec: Efficient And Robust Visual Speech Recognition Via Latent-to-latent Visual To Audio Representation Mapping (2023)6.77
- Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading (2023)3.89
- Litevsr: Efficient Visual Speech Recognition By Learning From Speech Representations Of Unlabeled Data (2023)5.84
- Syneslm: A Unified Approach For Audio-visual Speech Recognition And Translation Via Language Model And Synthetic Data (2024)0.00
- Large-scale Visual Speech Recognition (2018)14.43
- Visinger2+: End-to-end Singing Voice Synthesis Augmented By Self-supervised Learning Representation (2024)4.52
- Naturall2s: End-to-end High-quality Multispeaker Lip-to-speech Synthesis With Differential Digital Signal Processing (2025)0.00