Naturall2s: End-to-end High-quality Multispeaker Lip-to-speech Synthesis With Differential Digital Signal Processing
2025 Β· Yifan Liang, Fangkun Liu, Andong Li, et al.
Abstract
Recent advancements in visual speech recognition (VSR) have promoted progress in lip-to-speech synthesis, where pre-trained VSR models enhance the intelligibility of synthesized speech by providing valuable semantic information. The success achieved by cascade frameworks, which combine pseudo-VSR with pseudo-text-to-speech (TTS) or implicitly utilize the transcribed text, highlights the benefits of leveraging VSR models. However, these methods typically rely on mel-spectrograms as an intermediate representation, which may introduce a key bottleneck: the domain gap between synthetic mel-spectrograms, generated from inherently error-prone lip-to-speech mappings, and real mel-spectrograms used to train vocoders. This mismatch inevitably degrades synthesis quality. To bridge this gap, we propose Natural Lip-to-Speech (NaturalL2S), an end-to-end framework integrating acoustic inductive biases with differentiable speech generation components. Specifically, we introduce a fundamental frequenc
Authors
(none)
Tags
Stats
Related papers
- Robustl2s: Speaker-specific Lip-to-speech Synthesis Exploiting Self-supervised Representations (2023)4.52
- Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading (2023)3.89
- Naturalspeech: End-to-end Text To Speech Synthesis With Human-level Quality (2022)16.32
- Naturalspeech 2: Latent Diffusion Models Are Natural And Zero-shot Speech And Singing Synthesizers (2023)0.00
- Divise: Direct Visual-input Speech Synthesis Preserving Speaker Characteristics And Intelligibility (2025)5.58
- Let There Be Sound: Reconstructing High Quality Speech From Silent Videos (2023)6.34
- VITS2: Improving Quality And Efficiency Of Single-stage Text-to-speech With Adversarial Learning And Architecture Design (2023)12.40
- La-voce: Low-snr Audio-visual Speech Enhancement Using Neural Vocoders (2022)0.00