Divise: Direct Visual-input Speech Synthesis Preserving Speaker Characteristics And Intelligibility
2025 Β· Yifan Liu, Yu Fang, Zhouhan Lin
Abstract
Video-to-speech (V2S) synthesis, the task of generating speech directly from silent video input, is inherently more challenging than other speech synthesis tasks due to the need to accurately reconstruct both speech content and speaker characteristics from visual cues alone. Recently, audio-visual pre-training has eliminated the need for additional acoustic hints in V2S, which previous methods often relied on to ensure training convergence. However, even with pre-training, existing methods continue to face challenges in achieving a balance between acoustic intelligibility and the preservation of speaker-specific characteristics. We analyzed this limitation and were motivated to introduce DiVISe (Direct Visual-Input Speech Synthesis), an end-to-end V2S model that predicts Mel-spectrograms directly from video frames alone. Despite not taking any acoustic hints, DiVISe effectively preserves speaker characteristics in the generated audio, and achieves superior performance on both objective
Authors
(none)
Tags
Stats
Related papers
- Diffv2s: Diffusion-based Video-to-speech Synthesis With Vision-guided Speaker Embedding (2023)8.82
- Audio-visual Video-to-speech Synthesis With Synthesized Input Audio (2023)0.00
- Visagesyntalk: Unseen Speaker Video-to-speech Synthesis Via Speech-visage Feature Selection (2022)5.24
- Naturall2s: End-to-end High-quality Multispeaker Lip-to-speech Synthesis With Differential Digital Signal Processing (2025)0.00
- VCVTS: Multi-speaker Video-to-speech Synthesis Via Cross-modal Knowledge Transfer From Voice Conversion (2022)6.77
- Revise: Self-supervised Speech Resynthesis With Visual Input For Universal And Generalized Speech Enhancement (2022)0.00
- More Than Words: In-the-wild Visually-driven Prosody For Text-to-speech (2021)9.03
- VITS2: Improving Quality And Efficiency Of Single-stage Text-to-speech With Adversarial Learning And Architecture Design (2023)12.40