Improving Lip-synchrony In Direct Audio-visual Speech-to-speech Translation
2024 Β· Lucas Goncalves, Prashant Mathur, Xing Niu, et al.
Abstract
Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video, without any degradation in translation quality.
Authors
(none)
Tags
Stats
Related papers
- Visualtts: TTS With Accurate Lip-speech Synchronization For Automatic Voice Over (2021)9.41
- AV2AV: Direct Audio-visual Speech To Audio-visual Speech Translation With Unified Audio-visual Speech Representation (2023)6.77
- High-quality Automatic Voice Over With Accurate Alignment: Supervision Through Self-supervised Discrete Speech Units (2023)6.34
- Synthvsr: Scaling Up Visual Speech Recognition With Synthetic Supervision (2023)9.76
- Incorporating Ultrasound Tongue Images For Audio-visual Speech Enhancement (2023)0.00
- Naturall2s: End-to-end High-quality Multispeaker Lip-to-speech Synthesis With Differential Digital Signal Processing (2025)0.00
- Target Speaker Lipreading By Audio-visual Self-distillation Pretraining And Speaker Adaptation (2025)5.24
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34