Lipsound2: Self-supervised Pre-training For Lip-to-speech Reconstruction And Lip Reading
2021 Β· Leyuan Qu, Cornelius Weber, Stefan Wermter
Abstract
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is firstly pre-trained on \(\sim\)2400h multi-lingual (e.g. English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID, TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and -independent settings. In addition to English, we conduct Chinese speech reconstruction on the CMLR dataset to verify the i
Authors
(none)
Tags
Stats
Related papers
- Target Speaker Lipreading By Audio-visual Self-distillation Pretraining And Speaker Adaptation (2025)5.24
- Robustl2s: Speaker-specific Lip-to-speech Synthesis Exploiting Self-supervised Representations (2023)4.52
- Let There Be Sound: Reconstructing High Quality Speech From Silent Videos (2023)6.34
- Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading (2023)3.89
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification (2023)9.23
- Multi-grained Spatio-temporal Modeling For Lip-reading (2019)0.00
- Lira: Learning Visual Speech Representations From Audio Through Self-supervision (2021)11.58