Transface: Unit-based Audio-visual Speech Synthesizer For Talking Head Translation
2023 Β· Xize Cheng, Rongjie Huang, Linjun Li, et al.
Abstract
Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning. This approach circumvents delays and cascading errors associated with model cascading. However, talking head translation, converting audio-visual speech (i.e., talking head video) from one language into another, still confronts several challenges compared to audio speech: (1) Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors. (2) Talking head translation has a limited set of reference frames. If the generated translation exceeds the length of the original speech, the video sequence needs to be supplemented by repeating frames, leading to jarring video transitions. In this work, we propose a model for talking head translation, \textbf\{TransFace\}, which can directly translate audio-visual speech into audio-visual speech in other languages. It consists of a speech-t
Authors
(none)
Tags
Stats
Related papers
- Text-driven Talking Face Synthesis By Reprogramming Audio-driven Models (2023)2.26
- Translatotron 2: High-quality Direct Speech-to-speech Translation With Voice Preservation (2021)0.00
- From Faces To Voices: Learning Hierarchical Representations For High-quality Video-to-speech (2025)0.00
- Large-scale Multilingual Audio Visual Dubbing (2020)0.00
- See The Speaker: Crafting High-resolution Talking Faces From Speech With Prior Guidance And Region Refinement (2025)0.00
- Fluent And Low-latency Simultaneous Speech-to-speech Translation With Self-adaptive Training (2020)3.58
- Transvip: Speech To Speech Translation System With Voice And Isochrony Preservation (2024)5.24
- Learning To Speak Fluently In A Foreign Language: Multilingual Speech Synthesis And Cross-language Voice Cloning (2019)15.03