Spatio-temporal Attention Mechanism And Knowledge Distillation For Lip Reading
2021 Β· Shahd Elashmawy, Marian Ramsis, Hesham M. Eraqi, et al.
Abstract
Despite the advancement in the domain of audio and audio-visual speech recognition, visual speech recognition systems are still quite under-explored due to the visual ambiguity of some phonemes. In this work, we propose a new lip-reading model that combines three contributions. First, the model front-end adopts a spatio-temporal attention mechanism to help extract the informative data from the input visual frames. Second, the model back-end utilizes a sequence-level and frame-level Knowledge Distillation (KD) techniques that allow leveraging audio data during the visual model training. Third, a data preprocessing pipeline is adopted that includes facial landmarks detection-based lip-alignment. On LRW lip-reading dataset benchmark, a noticeable accuracy improvement is demonstrated; the spatio-temporal attention, Knowledge Distillation, and lip-alignment contributions achieved 88.43%, 88.64%, and 88.37% respectively.
Authors
(none)
Tags
Stats
Related papers
- Lip-listening: Mixing Senses To Understand Lips Using Cross Modality Knowledge Distillation For Word-based Models (2022)0.00
- Multi-grained Spatio-temporal Modeling For Lip-reading (2019)0.00
- Simullr: Simultaneous Lip Reading Transducer With Attention-guided Adaptive Memory (2021)8.09
- Target Speaker Lipreading By Audio-visual Self-distillation Pretraining And Speaker Adaptation (2025)5.24
- Audio-visual Representation Learning Via Knowledge Distillation From Speech Foundation Models (2025)7.81
- Lipsound2: Self-supervised Pre-training For Lip-to-speech Reconstruction And Lip Reading (2021)11.39
- Lipreading With 3D-2D-CNN BLSTM-HMM And Word-ctc Models (2019)0.00
- Improving Speaker-independent Lipreading With Domain-adversarial Training (2017)10.85