Lip-listening: Mixing Senses To Understand Lips Using Cross Modality Knowledge Distillation For Word-based Models
2022 Β· Hadeel Mabrouk, Omar Abugabal, Nourhan Sakr, et al.
Abstract
In this work, we propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training. Impressive progress in the domain of speech recognition has been exhibited by audio and audio-visual systems. Nevertheless, there is still much to be explored with regards to visual speech recognition systems due to the visual ambiguity of some phonemes. To this end, the development of visual speech recognition models is crucial given the instability of audio models. The main contributions of this work are i) building on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems; ii) leveraging audio data during training visual models, a feat which has not been utilized in prior word-based work; iii) proposing the Gaussian-shaped averaging in frame-level KD, as an efficient technique
Authors
(none)
Tags
Stats
Related papers
- Spatio-temporal Attention Mechanism And Knowledge Distillation For Lip Reading (2021)0.00
- Audio-visual Representation Learning Via Knowledge Distillation From Speech Foundation Models (2025)7.81
- Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification (2023)9.23
- Multi-grained Spatio-temporal Modeling For Lip-reading (2019)0.00
- Target Speaker Lipreading By Audio-visual Self-distillation Pretraining And Speaker Adaptation (2025)5.24
- Lipsound2: Self-supervised Pre-training For Lip-to-speech Reconstruction And Lip Reading (2021)11.39
- Integrated Multi-level Knowledge Distillation For Enhanced Speaker Verification (2024)0.00
- Lipformer: Learning To Lipread Unseen Speakers Based On Visual-landmark Transformers (2023)11.49