Learning Speaker-invariant Visual Features For Lipreading
2025 Β· Yu Li, Feng Xue, Shujie Li, et al.
Abstract
Lipreading is a challenging cross-modal task that aims to convert visual lip movements into spoken text. Existing lipreading methods often extract visual features that include speaker-specific lip attributes (e.g., shape, color, texture), which introduce spurious correlations between vision and text. These correlations lead to suboptimal lipreading accuracy and restrict model generalization. To address this challenge, we introduce SIFLip, a speaker-invariant visual feature learning framework that disentangles speaker-specific attributes using two complementary disentanglement modules (Implicit Disentanglement and Explicit Disentanglement) to improve generalization. Specifically, since different speakers exhibit semantic consistency between lip movements and phonetic text when pronouncing the same words, our implicit disentanglement module leverages stable text embeddings as supervisory signals to learn common visual representations across speakers, implicitly decoupling speaker-specifi
Authors
(none)
Tags
Stats
Related papers
- Learning Separable Hidden Unit Contributions For Speaker-adaptive Lip-reading (2023)0.00
- Lipformer: Learning To Lipread Unseen Speakers Based On Visual-landmark Transformers (2023)11.49
- Multi-grained Spatio-temporal Modeling For Lip-reading (2019)0.00
- Target Speaker Lipreading By Audio-visual Self-distillation Pretraining And Speaker Adaptation (2025)5.24
- Improving Speaker-independent Lipreading With Domain-adversarial Training (2017)10.85
- Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification (2023)9.23
- Lipper: Synthesizing Thy Speech Using Multi-view Lipreading (2019)10.61
- Selective Listening By Synchronizing Speech With Lips (2021)11.85