Lipformer: Learning To Lipread Unseen Speakers Based On Visual-landmark Transformers
2023 Β· Feng Xue, Yu Li, Deyin Liu, et al.
Abstract
Lipreading refers to understanding and further translating the speech of a speaker in the video into natural language. State-of-the-art lipreading methods excel in interpreting overlap speakers, i.e., speakers appear in both training and inference sets. However, generalizing these methods to unseen speakers incurs catastrophic performance degradation due to the limited number of speakers in training bank and the evident visual variations caused by the shape/color of lips for different speakers. Therefore, merely depending on the visible changes of lips tends to cause model overfitting. To address this problem, we propose to use multi-modal features across visual and landmarks, which can describe the lip motion irrespective to the speaker identities. Then, we develop a sentence-level lipreading framework based on visual-landmark transformers, namely LipFormer. Specifically, LipFormer consists of a lip motion stream, a facial landmark stream, and a cross-modal fusion. The embeddings from
Authors
(none)
Tags
Stats
Related papers
- Lipper: Synthesizing Thy Speech Using Multi-view Lipreading (2019)10.61
- Multi-grained Spatio-temporal Modeling For Lip-reading (2019)0.00
- Learning Speaker-invariant Visual Features For Lipreading (2025)0.00
- Simullr: Simultaneous Lip Reading Transducer With Attention-guided Adaptive Memory (2021)8.09
- Lipsound2: Self-supervised Pre-training For Lip-to-speech Reconstruction And Lip Reading (2021)11.39
- Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading (2023)3.89
- Target Speaker Lipreading By Audio-visual Self-distillation Pretraining And Speaker Adaptation (2025)5.24
- Learning Separable Hidden Unit Contributions For Speaker-adaptive Lip-reading (2023)0.00