Fluentlip: A Phonemes-based Two-stage Approach For Audio-driven Lip Synthesis With Optical Flow Consistency
2025 Β· Shiyan Liu, Rui Qu, Yan Jin
Abstract
Generating consecutive images of lip movements that align with a given speech in audio-driven lip synthesis is a challenging task. While previous studies have made strides in synchronization and visual quality, lip intelligibility and video fluency remain persistent challenges. This work proposes FluentLip, a two-stage approach for audio-driven lip synthesis, incorporating three featured strategies. To improve lip synchronization and intelligibility, we integrate a phoneme extractor and encoder to generate a fusion of audio and phoneme information for multimodal learning. Additionally, we employ optical flow consistency loss to ensure natural transitions between image frames. Furthermore, we incorporate a diffusion chain during the training of Generative Adversarial Networks (GANs) to improve both stability and efficiency. We evaluate our proposed FluentLip through extensive experiments, comparing it with five state-of-the-art (SOTA) approaches across five metrics, including a proposed
Authors
(none)
Tags
Stats
Related papers
- Let There Be Sound: Reconstructing High Quality Speech From Silent Videos (2023)6.34
- Lpips-attnwav2lip: Generic Audio-driven Lip Synchronization For Talking Head Generation In The Wild (2026)12.65
- Said: Speech-driven Blendshape Facial Animation With Diffusion (2023)0.00
- Multi-grained Spatio-temporal Modeling For Lip-reading (2019)0.00
- Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading (2023)3.89
- Audio2face: Generating Speech/face Animation From Single Audio With Attention-based Bidirectional LSTM Networks (2019)12.10
- FLOAT: Generative Motion Latent Flow Matching For Audio-driven Talking Portrait (2024)0.00
- Naturall2s: End-to-end High-quality Multispeaker Lip-to-speech Synthesis With Differential Digital Signal Processing (2025)0.00