Diffv2s: Diffusion-based Video-to-speech Synthesis With Vision-guided Speaker Embedding
2023 Β· Jeongsoo Choi, Joanna Hong, Yong Man Ro
Abstract
Recent research has demonstrated impressive results in video-to-speech synthesis which involves reconstructing speech solely from visual input. However, previous works have struggled to accurately synthesize speech due to a lack of sufficient guidance for the model to infer the correct content with the appropriate sound. To resolve the issue, they have adopted an extra speaker embedding as a speaking style guidance from a reference auditory information. Nevertheless, it is not always possible to obtain the audio information from the corresponding video input, especially during the inference time. In this paper, we present a novel vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique. In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time. Using the extracted vision-guided speaker embedding representations
Authors
(none)
Tags
Stats
Related papers
- Divise: Direct Visual-input Speech Synthesis Preserving Speaker Characteristics And Intelligibility (2025)5.58
- Visagesyntalk: Unseen Speaker Video-to-speech Synthesis Via Speech-visage Feature Selection (2022)5.24
- Diff-foley: Synchronized Video-to-audio Synthesis With Latent Diffusion Models (2023)0.00
- Audio-visual Video-to-speech Synthesis With Synthesized Input Audio (2023)0.00
- Unsupervised Audiovisual Synthesis Via Exemplar Autoencoders (2020)0.00
- VCVTS: Multi-speaker Video-to-speech Synthesis Via Cross-modal Knowledge Transfer From Voice Conversion (2022)6.77
- Revise: Self-supervised Speech Resynthesis With Visual Input For Universal And Generalized Speech Enhancement (2022)0.00
- More Than Words: In-the-wild Visually-driven Prosody For Text-to-speech (2021)9.03